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1 Introduction 

This paper is a welcome addition to the growing literature on preferential sampling 
in a geostatistical setting. Earlier papers cited by the authors have shown that pref¬ 
erential sampling materially affects parameter estimation and prediction. The authors 
now demonstrate that the same applies to design, or more specifically to the optimal 
augmentation of an initial set of geostatistical data that has been sampled preferen¬ 
tially. Almost in passing, the paper also sets out an algorithm for Bayesian inference 
under preferential sampling that is a useful contribution in its own right. Might we look 
forward to an R package implementation of this? 

Our comments fall into two categories: theoretical remarks on what we call adaptive 
design, including an explanation of why this does not necessarily require considera¬ 
tion of preferential sampling issues; practical constraints that may limit the scope for 
theoretically optimal designs to be used in practice, especially in low-resource settings. 

2 Adaptive geostatistical design and preferential 
sampling 

The topic of geostatistical design is multi-faceted. One useful distinction is between 
adaptive and non-adaptive designs. A non-adaptive design is one that is completely 
determined before any data are collected. An adaptive design is one in which an initial 
design is augmented in a way that depends on the analysis of interim data. We make 
two theoretical comments that follow from the definition of preferential sampling given 
in Diggle, Menezes and Su (2010). 

Firstly, an adaptive design need not be preferential. To see why, it is sufficient 
to consider a two-stage adaptive design, X = (Ao,Xi) with associated measurement 
data (>0 7 Yi), where subscripts 0 and 1 identify initial and follow-up stages, respectively. 
Similarly, write S = {So, Si) for the corresponding decomposition of the latent process S. 
Quite generally, we can factorise the joint distribution of {X, Y, S) as 

[X,r,5] = [S,Xo,Yo,Xi,Yi] 

= [5][Ao|5][ro|Ao,5][Ai|Fo,Ao,5][yi|Ai,yo,Ao,5]. (1) 
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On the right-hand side of (1), if the initial design is non-preferential, [XqIS”] = [^o], 
whilst by construction [Xillb, Xq,-S'] = [Xi|Fo,-^o]- It then follows that 

[X,y] = [Xo][X,\Xo,Yo]x [[yo\Xo,S][Y,\X,,Yo,Xo,S][S]dS 

Js 

= [Xjyo] X [F|X] (2) 

and the log-likelihood is a sum of two components, log[X|lb] + log[X|X]. This shows 
that the conditional likelihood, [TjX], can legitimately be used for inference although, 
depending on how [XjYo] is specified, to do so may be inefficient. The argument leading 
to (2) is closely related to the proof that if data are “missing at random” the missingness 
mechanism can be ignored when using likelihood-based inference (Rubin, 1976), and 
extends to multi-stage adaptive designs with essentially only notational changes. 

Secondly, shared dependence of a design X and the latent process S on observed 
covariates does not necessarily render X preferential. Specifically, if Z denotes the co¬ 
variate process, then [X, S’jZ] = [-S'|Z][X|S', Z]. The requirement for the design to be 
non-preferential is that [Xj-S, Z] = [XjZ], which in general is a weaker requirement than 
[Xj-S] = [Xj. This illustrates, not for the first time, that spatial statistical inference can 
be greatly simplified by judicious selection of spatially referenced covariates. 


3 Some practical constraints on geostatistical design 

The paper makes a number of explicit and implicit assumptions that together provide a 
very reasonable framework for theoretical analysis, but it is worth bearing in mind that 
in any particular application, the design problem may be constrained in various ways. 
These assumptions include the following: 

1. The spatial integral of the predictive variance is an appropriate measure of predic¬ 
tive performance 

This would not be true if, for example, S{x) represents pollution and the main 
objective is to monitor compliance with environmental standards; see Fanshawe 
and Diggle (2013). 

2. Sampling may not be equally costly at every location 

Put another way, should the design be constrained by the number of locations to 
be sampled, or by the total sampling effort in the field? An obvious example of 
this is when travel-time represents a non-negligible proportion of field-effort; see, 
for example. Figures 2 and 4 of Diggle et al. (2007). 

3. The number of potential sampling points may be finite 

This applies to disease prevalence surveys when the sampling unit is either a 
household or a well-defined community. We are currently working on the adaptive 
design of an ongoing malaria prevalence mapping project around the perimeter of 
the Majete national park, Malawi, where the first task has been to enumerate and 
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geo-locate each household in each village within the study-region. In the course 
of the project, we expect to sample all households, but the order in which they 
are sampled (in a sequence of monthly field-trips) will be chosen adaptively with 
the aim of optimising the estimation of the complete spatio-temporal variation in 
malaria prevalence, which is known to include a strong seasonal component. 

None of these these comments are intended to detract from the value of the paper 
on its own terms. Theoretical studies of this kind help to further our understanding of 
important, and often subtle, methodological issues around modelling and inference for 
preferentially sampled geostatistical data. 
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